Method and system for network modeling to enlarge the search space of candidate genes for diseases

ABSTRACT

With the advent of low cost, high-throughput whole genome sequencing (“next generation sequencing”), tools are available to assay human genetic variation contributing to inherited disease syndromes. A method is disclosed for prioritization of genetic variants, and identification of disease genes, using network modeling of gene associations.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims priority to U.S. Provisional Application No. 61/497,542 filed Jun. 16, 2011, which is hereby incorporated by reference in its entirety for all purposes.

This application is related to U.S. Non-provisional application Ser. Nos. 13/445,923 filed Apr. 13, 2012; 13/445,925 filed Apr. 13, 2012, and 13/445,926 filed Apr. 13, 2012, herein incorporated by reference in their entirety for all purposes.

STATEMENT OF GOVERNMENT SPONSORED SUPPORT

This invention was made with Government support under contracts HL083914 and OD004613 awarded by the National Institutes of Health. The Government has certain rights in this invention.

FIELD OF THE INVENTION

The present invention generally relates to the field of genomics. More particularly, the present invention relates to the field of whole genome sequencing.

BACKGROUND OF THE INVENTION

Despite advances in diagnosis and management, the clinical syndrome of heart failure remains a leading cause of hospitalization and death in developed countries with a worse five year prognosis than most cancers (Jhund P S, Macintyre K, Simpson C R, Lewsey J D, Stewart S, Redpath A, Chalmers J W, Capewell S, McMurray J J. Long-term trends in first hospitalization for heart failure and subsequent survival between 1986 and 2003: a population study of 5.1 million people. Circulation. 2009; 119:515-523; these and all other references cited herein are incorporated by reference for all purposes). Disparate mechanisms of cardiac stress invoke a common pathway of adaptive myocyte hypertrophy which progresses to impaired systolic function and dilated, thinned myocardium. Expression of several genes is known to differ reproducibly between normal adult myocardium and failing and hypertrophied myocardium. Some of these differentially expressed genes, such as cardiac muscle alpha-myosin heavy chain 6 (MYH6) and cardiac muscle beta-myosin heavy chain 7 (MYH7), comprise the contractile apparatus of the sarcomere. Others, such as sarcoplasmic reticulum calcium ATPase 2 (ATP2A2), phospholamban (PLN), and solute carrier family 2, facilitated glucose transporter member 4 (SLC2A4), are principally involved in myocardial calcium cycling and energetics (Feldman A M, Weinberg E O, Ray P E, Lorell B H. Selective changes in cardiac gene expression during compensated hypertrophy and the transition to cardiac decompensation in rats with chronic aortic banding. Circ Res. 1993; 73:184-192; Lyn D, Liu X, Bennett N A, Emmett N L. Gene expression profile in mouse myocardium after ischemia. Physiol Genomics. 2000; 2:93-100; Yue P, Long C S, Austin R, Chang K C, Simpson P C, Massie B M. Post-infarction heart failure in the rat is associated with distinct alterations in cardiac myocyte molecular phenotype. J Mol Cell Cardiol. 1998; 30:1615-1630; Rajabi M, Kassiotis C, Razeghi P, Taegtmeyer H. Return to the fetal gene program protects the stressed heart: a strong hypothesis. Heart Fail Rev. 2007; 12:331-343; Chien K R. Stress pathways and heart failure. Cell. 1999; 98:555-558; Hoshijima M, Chien K R. Mixed signals in heart failure: cancer rules. J Clin Invest. 2002; 109:849-855).

It has thus been proposed that myocardial adaptation involves myosin switch and a switch from fatty acid oxidation to glycolysis as the main metabolic pathway of the cardio myocyte. Similar differential gene expression has been observed in fetal myocardium, leading to a hypothesis that there is a coordinated gene program that overlaps between fetal and failing and hypertrophied myocardium 5. It is, however, unclear whether such changes represent a truly coordinated biological response mediated by common regulators.

The incomplete understanding of these shared genomic response pathways is largely due to statistical and computational limitations inherent to traditional differential expression experiments, in which the expression of a pre-defined set of genes is compared between two different phenotypes. These limitations have been compounded by small sample sizes in previous microarray experiments and have precluded prior unbiased genome-wide evaluation of this hypothesis.

Network based approaches can be used to identify patterns of shared gene expression, regulation, or protein interaction while limiting the number of statistical comparisons and thus the false discovery rate (Jeong H, Mason S P, Barabasi A L, Oltvai Z N. Lethality and centrality in protein networks. Nature. 2001; 411:41-42; Jeong H, Tombor B, Albert R, Oltvai Z N, Barabasi A L. The large-scale organization of metabolic networks. Nature. 2000; 407:651-654). Literature derived networks have been used to reveal potential targets for abrogation of restenotic atherosclerotic coronary artery disease and in doing so recapitulated targets of known therapeutic agents. Other investigators have used gene coexpression analysis to uncover pathways regulated by disease susceptibility genes (Chen Y, Zhu J, Lum P Y, Yang X, Pinto S, MacNeil D J, Zhang C, Lamb J, Edwards S, Sieberts S K, Leonardson A, Castellini L W, Wang S, Champy M F, Zhang B, Emilsson V, Doss S, Ghazalpour A, Horvath S, Drake T A, Lusis A J, Schadt E E. Variations in DNA elucidate molecular networks that cause disease. Nature. 2008; 452:429-435). Candidate causal gene relationships have subsequently been validated with high fidelity in transgenic and gene knock-down experiments (Yang X, Deignan J L, Qi H, Zhu J, Qian S, Zhong J, Torosyan G, Majid S, Falkard B, Kleinhanz R R, Karlsson J, Castellani L W, Mumick S, Wang K, Xie T, Coon M, Zhang C, Estrada-Smith D, Farber C R, Wang S S, van Nas A, Ghazalpour A, Zhang B, Macneil D J, Lamb J R, Dipple K M, Reitman M L, Mehrabian M, Lum P Y, Schadt E E, Lusis A J, Drake T A. Validation of candidate causal genes for obesity that affect shared metabolic pathways and networks. Nat Genet. 2009; 41:415-423; Schadt E E. Molecular networks as sensors and drivers of common human diseases. Nature. 2009; 461:218-223).

SUMMARY OF THE INVENTION

With the advent of low cost, high-throughput whole genome sequencing (“next generation sequencing”), the tools to assay human genetic variation contributing to inherited disease syndromes are now available. Traditional approaches to identifying causative genes and genetic variants in diseased individuals have leveraged inheritance information and the segregation of known genetic markers with disease to pinpoint areas of the genome implicated in disease. Whole genome sequencing allows for assaying genetic variation at every position in the human genome and has the potential to revolutionize the study of inherited diseases. One of the main barriers to the realization of this goal in genome interpretation pipelines is the difficulty in prioritizing the millions of genetic variants in known genes. This is currently addressed in many research settings by identifying variants in genes known previously to be associated with human disease syndromes, an approach which limits the scope of this biological technology. In an embodiment of the present invention, however, a method is described for prioritization of genetic variants, and identification of disease genes, using network modeling of gene associations.

Gene coexpression data, protein interaction, and genetic variation data are rich data-sources for inferring relationships between genes. A method according to an embodiment of the present invention uses these data to create Bayesian networks of causal interactions between genes, with genetic variation as a perturbation from which causality is inferred, and then identifies modular sub-networks from hierarchical clustering of links between genes.

Using known interactions in a curated compendium of disease-gene interactions, a method according to an embodiment of the present invention is disclosed to prioritize genome-wide genetic variants or gene products relevant to Mendelian disease according to their impact and actionability as a scaffold. In a method of the present invention, the subnetworks are searched for genes less than or equal to two steps away from known disease genes.

In an embodiment, this expanded set of putative disease genes is used in annotation of whole genome sequences to restrict the search space for variants with causal implications in inherited disease syndromes as follows: 1) variants in known coding (exons) regions are annotated according to predicted pathogenicity to prioritize genome-wide genetic variants or gene products relevant to Mendelian disease according to their impact and actionability and 2) variants in important non-coding regions (splice sites, 5′ and 3′ un-translated regions, microRNA targets and miRNA sequences targeting the gene, enhancers, promoters, and repressors) are identified.

The present invention has application in at least the following areas: mapping causal gene loci in inherited disease syndromes using whole genome sequence data; creating disease and carrier state predictions from whole genome sequencing of healthy individuals as part of a genome sequence interpretation pipeline; identifying targets for research on the pathophysiology of inherited disease syndromes; and prioritization of genes for designing capture-based sequencing platforms and hybridization-based genotyping technologies for rapid molecular diagnosis of inherited disease syndromes.

Among other things, network analysis techniques allow a more accurate reflection of underlying systems biology to be realized than traditional unidimensional molecular biology approaches. In yet another embodiment of the present invention, using gene coexpression network analysis, the gene expression network topology of cardiac hypertrophy and failure is defined as well as the extent of recapitulation of fetal gene expression programs in failing and hypertrophied adult myocardium. Also, the development of network analysis techniques for this embodiment of the present invention, provides examples for network analysis techniques in other embodiments of the present invention.

In an embodiment of the present invention, unbiased marginal module analysis of gene coexpression networks from all murine myocardial transcript abundance data available in the published literature is used to formally evaluate the hypotheses that there are gene programs unique to fetal myocardium and these gene programs are re-capitulated in myocardial adaptation.

In an embodiment, all myocardial transcript data in the Gene Expression Omnibus (n=1617) was assembled. Since hierarchical analysis revealed species had primacy over disease clustering, this analysis focused on the most complete (murine) dataset (n=478). Using gene expression markers of myocardial, adaptation were members of upregulated modules but not hub genes genes. In an embodiment, ZIC2 was identified as a novel transcription factor associated with coexpression modules common to developing and failing myocardium. Of 50 fetal gene co-expression modules, three (6%) were reproduced in hypertrophied myocardium and seven (14%) were reproduced in failing myocardium. One fetal module was common to both failing and hypertrophied myocardium.

Among other things, network modeling allows systems analysis of cardiovascular development and disease. An embodiment of the present invention revealed specific gene expression modules active during both development and disease and specific candidates for their regulation.

These and other embodiments can be more fully appreciated upon an understanding of the detailed description of the invention as disclosed below in conjunction with the attached figures.

BRIEF DESCRIPTION OF THE DRAWINGS

The following drawings will be used to more fully describe embodiments of the present invention.

FIG. 1 is a chart showing myocardial gene microarray expression data used in an embodiment of the present invention.

FIGS. 2A-2D are charts and graphs that assist the identification of gene coexpression modules specific to fetal tissue according to an embodiment of the present invention.

Shown in FIGS. 3A-3E are charts and graphs demonstrating the reproducibility of developmental gene modules in heart failure and cardiac hypertrophy according to an embodiment of the present invention.

Shown in FIG. 4 is the topology of gene coexpression module common to developing myocardium and hypertrophied and failing myocardium according to an embodiment of the present invention

Shown in FIG. 5 are representations of the reproducibility of higher-order network topology between fetal myocardium and myocardium in murine heart failure and cardiac hypertrophy according to an embodiment of the present invention.

FIG. 6 is a block diagram of a computer system on which the present invention can be implemented.

Shown in FIGS. 7A and 7B are charts demonstrating the normalization of gene expression data using mean absolute deviation method reduces intensity-dependent differences for experiments from different experimental conditions according to an embodiment of the present invention.

Shown in FIGS. 8A-8D are charts that assist in understanding gene expression profiles from different experimental conditions clustered by species according to an embodiment of the present invention.

Shown in FIGS. 9A and 9B are gene coexpression network adjacency soft thresholds for different conditions according to an embodiment of the present invention.

Shown in FIG. 10 are gene coexpression network adjacency soft threshold for different conditions according to an embodiment of the present invention.

Shown in FIG. 11 is a sensitivity analysis for parameters used in the dynamic tree cut algorithm according to an embodiment of the present invention.

Shown in FIGS. 12A-12C are charts and graphs demonstrating the identification and validation of fetal gene modules according to an embodiment of the present invention.

Shown in FIGS. 13A and 13B is a topology of developmental gene coexpression modules containing key marker genes.

FIG. 14 (Table 1) is tabular data regarding over-represented biological processes according to gene ontology (GO) in the fetal module recapitulated in heart failure and hypertrophy according to an embodiment of the present invention.

FIG. 15 (Table 2) is tabular data summarizing genes involved in protein catabolism biological process from fetal module recapitulated in heart failure and hypertrophy according to an embodiment of the present invention.

FIG. 16 is a flowchart depicting a method according to an embodiment of the present invention.

FIG. 17 is a flowchart depicting a method according to an embodiment of the present invention.

DETAILED DESCRIPTION OF THE INVENTION

Among other things, the present invention relates to methods, techniques, and algorithms that are intended to be implemented in a digital computer system 100 such as generally shown in FIG. 6. Such a digital computer or embedded device is well-known in the art and may include the following.

Computer system 100 may include at least one central processing unit 102 but may include many processors or processing cores. Computer system 100 may further include memory 104 in different forms such as RAM, ROM, hard disk, optical drives, and removable drives that may further include drive controllers and other hardware. Auxiliary storage 112 may also be include that can be similar to memory 104 but may be more remotely incorporated such as in a distributed computer system with distributed memory capabilities.

Computer system 100 may further include at least one output device 108 such as a display unit, video hardware, or other peripherals (e.g., printer). At least one input device 106 may also be included in computer system 100 that may include a pointing device (e.g., mouse), a text input device (e.g., keyboard), or touch screen.

Communications interfaces 114 also form an important aspect of computer system 100 especially where computer system 100 is deployed as a distributed computer system. Computer interfaces 114 may include LAN network adapters, WAN network adapters, wireless interfaces, Bluetooth interfaces, modems and other networking interfaces as currently available and as may be developed in the future.

Computer system 100 may further include other components 116 that may be generally available components as well as specially developed components for implementation of the present invention. Importantly, computer system 100 incorporates various data buses 116 that are intended to allow for communication of the various components of computer system 100. Data buses 116 include, for example, input/output buses and bus controllers.

Indeed, the present invention is not limited to computer system 100 as known at the time of the invention. Instead, the present invention is intended to be deployed in future computer systems with more advanced technology that can make use of all aspects of the present invention. It is expected that computer technology will continue to advance but one of ordinary skill in the art will be able to take the present disclosure and implement the described teachings on the more advanced computers or other digital devices such as mobile telephones or “smart” televisions as they become available. Moreover, the present invention may be implemented on one or more distributed computers. Still further, the present invention may be implemented in various types of software languages including C, C++, and others. Also, one of ordinary skill in the art is familiar with compiling software source code into executable software that may be stored in various forms and in various media (e.g., magnetic, optical, solid state, etc.). One of ordinary skill in the art is familiar with the use of computers and software languages and, with an understanding of the present disclosure, will be able to implement the present teachings for use on a wide variety of computers.

The present disclosure provides a detailed explanation of the present invention with detailed explanations that allow one of ordinary skill in the art to implement the present invention into a computerized method. Certain of these and other details are not included in the present disclosure so as not to detract from the teachings presented herein but it is understood that one of ordinary skill in the at would be familiar with such details.

Methods

Among other things the present invention includes embodiments for incorporating network modeling to enlarge a search space of candidate genes for known diseases as will be described further below.

Interaction Network Generation Methods

In an embodiment of the present invention, a method is disclosed for generating an interaction network. In an embodiment of the present invention, such an interaction network includes an undirected, weighted graph, G, comprising edges, E, representing connections between vertexes, V, which in turn represent genes or protein products.

Among other things, embodiments of the present invention use data sources for developing interaction networks such as the weighted graph G. A source for such data includes gene expression microarray data from normal and diseased human and mammalian model system tissues. Another source includes targeted genome wide genotyping or sequencing of DNA. Still another source includes sequencing-based mRNA quantification such as RNA-seq. Still other sources include: protein-protein interaction data from yeast two-hybrid experiments, protein co-immunoprecipitation, mass spectrometry; curated gene and protein interaction pathways; semantic text mining of the published literature; and curated miRNA binding site and sequence collections; targeted and comprehensive gene-knockout and knock-down experiments. Many other data sources for interaction networks are possible as would be understood by one of ordinary skill in the art upon understanding the teachings of the present disclosure. Indeed, it can be expected that many other data sources will be developed that can be used in accordance with the teachings of the present invention.

Where quantitative data is available, such as in gene coexpression microarray data or sequence-based mRNA quantification data, pairwise covariance is used in an embodiment to define edges, E_(ij), for all genes or proteins ij in the dataset of interest.

Where qualitative interaction data is available, such as in protein-protein interaction data and curated and semantic text mining data, derived pathway data, e.g., Eij=={0,1}, is conditional on whether a described interaction exists.

In an embodiment of the present invention, communities of links are chosen according to the method of Ahn, et al (Ahn Y, Bagrow J, Lehmann S. Link communities reveal multiscale complexity in networks. Nature. 2010; 466:761-764, herein incorporated by reference for all purposes). This method creates hierarchical, overlapping communities of links using a single link, hierarchical clustering of edges of an unweighted, undirected graph describing the topological overlap between all nodes sharing a common neighbor. In an embodiment, edges can be directed using genotype-gene expression information as a Bayesian prior to inform causality.

Methods For Prioritizing Genes Using Interaction Networks

With a network in hand, such as one described above, a method of the present invention queries such network for known trait-associated genes. For example, in an embodiment of the present invention, genes that are within a predetermined path length from known disease are retained for prioritizing genetic variants. For example, in an embodiment of the present invention genes that have path length≦2 from known disease genes are retained for prioritizing genetic variants discovered using the methods described above for the network.

In an embodiment, genes that co-reside with known disease genes in sub-networks or communities discovered via link clustering are prioritized for searching genetic variants discovered using the network methods described above. Among other things, embodiments of the present invention use data sources for disease gene discovery. A source for such data includes targeted and comprehensive DNA re-sequencing (targeted, whole exome, or whole genome sequencing) in subjects with known or unknown trait phenotypes. Another source includes genotyping arrays. Still other sources includes targeted and comprehensive sequencing based RNA quantification (RNA-seq) and Epigenetic data sources such as chromatin immunoprecipitation and sequencing or genotyping, bisulfite sequencing. Many other data sources for disease gene discovery are possible as would be understood by one of ordinary skill in the art upon understanding the teachings of the present disclosure. Indeed, it can be expected that many other data sources will be developed that can be used in accordance with the teachings of the present invention.

In another embodiment, prior probabilities or expectations of pathogenicity of discovered variants are updated according to co-residence in shared sub-networks with disease genes (“guilt by association”). In yet another embodiment, these prior probabilities/expectations are used to update existing prior probabilities based on amino acid change, evolutionary conservation, co-segregation with trait phenotype or enrichment. Among other things, in an embodiment of the present invention, these prior expectations can be used for discrete filtering. Also, in another embodiment, prior probabilities in a Bayesian model of likely pathogenicity for each variant discover in a sequencing project can be generated.

Shown in FIG. 16 is an embodiment of a method according to the present invention. As shown, at step 1602, network data sources are received. In an embodiment of the present invention this includes gene expression microarray data from normal and diseased human and mammalian model system tissues. Another source includes targeted genome wide genotyping or sequencing of DNA. Still another source includes sequencing-based mRNA quantification such as RNA-seq. Still other sources include: protein-protein interaction data from yeast two-hybrid experiments, protein co-immunoprecipitation, mass spectrometry; curated gene and protein interaction pathways; semantic text mining of the published literature; and curated miRNA binding site and sequence collections; targeted and comprehensive gene-knockout and knock-down experiments. Many other data sources for interaction networks are possible as would be understood by one of ordinary skill in the art upon understanding the teachings of the present disclosure. Indeed, it can be expected that many other data sources will be developed that can be used in accordance with the teachings of the present invention.

From the network data sources, networks of interest are generated at step 1604. Such networks can be generated as generally described above, as described more fully further below, or according to compatible methods known to those of ordinary skill in the art.

From the generated network, at step 1606, known genes are identified. Among other things, in an embodiment of the present invention, known disease genes are also identified. Sources for known disease genes include curated sets of genes associated with Mendelian disorders. Another source of known disease data includes sets of genes discovered in complex trait, common phenotype associations (e.g., genome-wide association studies, candidate gene studies, and targeted re-sequencing discovery using exome and whole genome sequencing data). Many other data sources for known disease genese are possible as would be understood by one of ordinary skill in the art upon understanding the teachings of the present disclosure. Indeed, it can be expected that many other data sources will be developed that can be used in accordance with the teachings of the present invention.

At step 1608, genetic variation data is received. It should be noted, however, that the method 1600 of FIG. 16 as well as other methods described in the present disclosure are not necessarily intended to be performed in the order presented. Indeed, those of ordinary skill in the art would understand that certain steps can be performed before or after certain other steps. Moreover, those of ordinary skill in the art will understand that certain steps of the methods described herein can be performed in a pipelined or parallel fashion. Indeed, certain of the steps of the present invention can be performed using computational techniques where there order can be a combination of serial and parallel processing.

Using the genetic variation data as well as the identification of known diseases, genetic variants are identified at step 1610 that are in close proximity to known trait genes. In an embodiment of the present invention, genes that have a path length within a predetermined range are identified. In an embodiment, the predetermined range is a path length≦2 from known disease genes.

At step 1612, in an embodiment of the present invention, identified genes are prioritized for searching genetic variants. For example, in an embodiment of the present invention, genes that co-reside with known disease genes in sub-networks or communities discovered via link clustering are prioritized for searching genetic variants discovered using methods for disease gene discovery, for example.

Using known interactions in a curated compendium of disease-gene interactions, a method according to an embodiment prioritizes genome-wide genetic variants or gene products relevant to Mendelian disease according to their impact and actionability as a scaffold.

In an embodiment, the expanded set of putative disease genes is used in annotation of whole genome sequences to restrict the search space for variants with causal implications in inherited disease syndromes. For example, in an embodiment, variants in known coding (exons) regions are annotated according to predicted pathogenicity to prioritize genome-wide genetic variants or gene products relevant to Mendelian disease according to their impact and actionability and variants in important non-coding regions (splice sites, 5′ and 3′ un-translated regions, microRNA targets and miRNA sequences targeting the gene, enhancers, promoters, and repressors) are identified.

Applications

Population wide sequencing is a near-term reality, and the ability to interpret genome sequences from individuals will determine the commercial viability of this technology for advancing personalized health care. Embodiments of the present invention provide solutions for identifying candidate disease genes in individuals with inherited disease syndromes, and also for identification of risk alleles and carrier states in disease genes in healthy individuals.

Among other things, the present invention can be used for mapping causal gene loci in inherited disease syndromes using whole genome sequence data. The present invention can also be used for creating disease and carrier state predictions from whole genome sequencing of healthy individuals as part of a genome sequence interpretation pipeline. Also, embodiments of the present invention can be used for identifying targets for research on the pathophysiology of inherited disease syndromes. The teachings of the present invention can also be used for prioritization of genes for designing capture-based sequencing platforms and hybridization-based genotyping technologies for rapid molecular diagnosis of inherited disease syndromes. Many other applications are possible as would be understood by those of ordinary skill in the art.

Alternative Embodiments

Variations for the methods of the present invention are many. For example, hierarchical link clustering as implemented in embodiments of the present invention can be computationally intensive. To address this issue, hierarchical clustering of genes based on topological overlap can be implemented where topological overlap is a similarity measure that captures the degree of shared network neighbors. Also, based on results of “leave-one-out” cross-validation approach, the threshold for step length (degree of separation between network neighbors) in identification of genes related to known disease genes can be modified. Also, clustering clinical phenotypes based on shared genetic and clinical features can be implemented so as to enlarge the starting scaffold set of disease genes.

Network Methods

Embodiments of the present invention use network analysis techniques to allow for more accurate reflection of underlying systems biology to be realized than traditional unidimensional molecular biology approaches. So as to provide an example of network analysis and its development for use in other embodiments of the present invention, a concrete example will be provided further below where, using gene coexpression network analysis, a gene expression network topology of cardiac hypertrophy is developed.

Shown in FIG. 17 is a flowchart depicting a method 1700 for network analysis according to an embodiment of the present invention. As shown, at step 1702, a database was queried using terms of interest. In an embodiment of the present invention, the database contains all microarray data that are presented in the published literature. In other embodiments, different databases can be used. In an embodiment of the present invention to be described further below, the terms heart, myocardium, and cardiovascular to obtain all datasets describing microarray experiments involving myocardial tissue. In other embodiments, the search terms would vary according to the specific application. At step 1704, certain retrieved information may be excluded according to the qualities of the information that is retrieved.

At step 1706, data combination and normalization is performed. In an embodiment, this step allows for the comparison of gene expression data between different platforms and experimental conditions. Next, at step 1708, weighted gene coexpression network analysis is performed. In an embodiment, this is done to identify groups of genes with similar patterns of expression across experimental samples. These groups of genes are also called modules.

At step 1710, the reproducibility of modules is then evaluated. In an embodiment, this is done by calculating the extent of intramodular connectivity preservation. Afterwards, at step 1712, cluster validation is performed.

In an embodiment of the present invention addressing cardiac issues, at step 1714, conservation of fetal modules was evaluated. In an embodiment, random permutation are used to assess the reproducibility of fetal modules in normal adult datasets.

A transcription factor analysis is performed at step 1616. In an embodiment, gene coexpression modules were analyzed for over-representation of known transcription factor targets. At step 1618, scaled gene expression profile modules are summarized. In an embodiment, a first singular vector (called an eigengene) is used for this step. The eigengene has been found to be equivalent to the first principal component and explains the largest proportion of variance of the module genes.

Details of method 1600 will be described in the further description below in conjunction with the figures. The network analysis techniques described below will, however, also be applicable to the methods and embodiments described further above.

Data Acquisition and Normalization

The Gene Expression Omnibus (GEO, http://www.ncbi.nlm.nih.gov/geo/) is a publicly available repository of all microarray data that are presented in the published literature. This database was queried using the terms heart, myocardium, and cardiovascular to obtain all datasets describing microarray experiments involving myocardial tissue (n=2546). Because absolute expression data is required for weighted gene coexpression network analysis, log transformed dual dye datasets were excluded (n=43). Of these datasets, 1617 contained information regarding experimental conditions, species and a phenotype identifiable as developmental (fetal or post natal days 0 to 10 without experimental manipulation of myocardium), normal adult (adult tissue with no experimental manipulation of myocardium), hypertrophy (histologically or anatomically identifiable cardiac hypertrophy), or heart failure (clinical syndrome of heart failure and impaired cardiac systolic and/or diastolic dysfunction).

Data combination and normalization was performed to allow comparison of gene expression data between different platforms and experimental conditions. So as not to distract from the present teachings, further details of an embodiment for data combination and normalization are included in a Supplementary Methods section below.

Summary statistics showing minimization of bias with this method of normalization are provided in FIG. 7. As shown in FIG. 8, gene array experiments clustered by species, suggesting a species dependency for gene coexpression. Therefore, the present analysis focuses on the species with the most complete dataset, choosing the 757 arrays performed in mouse myocardium. An additional 255 experiments involved transgenic deletion, insertion, or mutation of genes important to myocardial function and/or myocardial gene expression and were thus excluded. As physiological hypertrophy is a state distinct from pathological hypertrophy that does not progress to heart failure, 24 datasets describing hysiological hypertrophy induced by chronic exercise were excluded (see FIG. 1). The final dataset contained 12,620 genes represented in 478 samples.

FIG. 1 shows myocardial gene microarray expression data used in an embodiment of the present invention. As shown, all murine microarray data from the Gene Expression Omnibus (GEO) database were queried with the search terms heart, myocardium, and cardiovascular.

Weighted Gene Coexpression Network Analysis

Weighted gene coexpression network analysis was performed to identify gene coexpression modules, that is, groups of genes with similar patterns of expression across experimental samples (Zhang B, Horvath S. A general framework for weighted gene co-expression network analysis. Stat Appl Genet Mol Biol. 2005; 4:Article17). Gene coexpression networks are defined in this work as a weighted graph in which genes are represented as vertices and connections strengths, or adjacencies, are represented as edges between vertices. These gene modules were first identified in fetal myocardial tissue, and their reproducibility was evaluated in normal adult, hypertrophied, and failing myocardial tissue samples. Gene modules were derived in a training set of 43 randomly chosen fetal samples (67% of all fetal samples). The remaining 21 fetal samples used for module validation were not used for this initial module identification. To carry out module detection, agglomerative hierarchical clustering of adjacencies given by the topological overlap measure was used. The topological overlap measure is a network adjacency based on shared network neighbors, which is in turn based on weighted Pearson correlation between gene expression profiles. This network adjacency is robust to noise and amplifies connectivity patterns in sparse networks (Yip A M, Horvath S. Gene network interconnectedness and the generalized topological overlap measure. BMC Bioinformatics. 2007; 8:22). Separate adjacencies were constructed for each phenotype considered. To cut branches of the tree into gene modules, the dynamic tree cut algorithm was used which iteratively searches for stable branch size and chooses clusters based on the shape of each tree branch (Langfelder P, Zhang B, Horvath S. Defining clusters from a hierarchical cluster tree: the Dynamic Tree Cut package for R. Bioinformatics. 2008; 24:719-720). This algorithm allows manipulation of several parameters controlling resultant cluster size and cohesiveness; sensitivity analysis was performed for each of these parameters. The modules from the fetal network were then carried forward for all subsequent analysis. Intramodular connectivity kin was defined for each gene as the sum of adjacencies with all other genes in the module to which it was assigned. A hub gene was then identified for each module as the gene with the single greatest kin. Module composition was further assessed according to defined gene ontologies (GO) using the DAVID analysis tool (Huang da W, Sherman B T, Lempicki R A. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat Protoc. 2009; 4:44-57; Dennis G, Jr., Sherman B T, Hosack D A, Yang J, Gao W, Lane H C, Lempicki R A. DAVID: Database for Annotation, Visualization, and Integrated Discovery. Genome Biol. 2003; 4:P3). All p values reported for GO analysis were corrected for multiple comparisons using the Benjamini correction and both corrected and uncorrected p values are reported. Networks were visualized using the visANT network visualization tool (Hu Z, Hung J H, Wang Y, Chang Y C, Huang C L, Huyck M, Delisi C. VisANT 3.5: multiscale network visualization, analysis and inference based on the gene ontology. Nucleic Acids Res. 2009; 37:W 115-21).

Evaluating Reproducibility of Gene Modules

To evaluate reproducibility of modules between networks, the extent of intramodular connectivity preservation was calculated. To this end, for each module, the correlation Mpres=cor(k^(l),k^(m)) between intramodular connectivity vectors k^(l) and k^(m) was calculated. These vectors described intramodular connectivities kin for genes within each module across two networks N_(l) and N_(m) The closer M_(pres) is to 1, the better preserved is the module structure across networks N_(l) and N_(m). An iterative significance test was used to assign a p value to this measure of module reproducibility between networks N_(l) and N_(m). The null hypothesis for this significance test was that derived modules were preserved between networks N_(I) and N_(m) no better than modules derived from random clustering. To evaluate this hypothesis, gene module identifiers were randomly permuted such that gene modules of the same size but random gene composition were generated. Ten thousand such permutations were performed and M_(rand), the module preservation of each random module assignment, was calculated for each permutation. The probability of the null hypothesis p was then calculated as the proportion of permutations in which the preservation of random modules between networks N_(l) and N_(m) was greater than that of derived modules (M_(rand)>M_(pres)). This significance test was used for module internal validation and evaluation of fetal module reproducibility in adult normal, failing, and hypertrophied myocardial tissue.

Cluster Validation

Clusters were first internally validated in 21 samples not used for module derivation. To prevent spurious inference of association from repeated sampling across the set of gene modules, the Bonferroni correction was employed in each comparison to determine significance of p values derived via permutation derived significance testing described above. Thus, module internal validity was inferred from a p value of less than 5.5×10⁴ for 90 comparisons.

Conservation of Fetal Modules in Normal Adults, Heart Failure, and Hypertrophy

Random permutation was again used to assess the reproducibility of fetal modules in normal adult datasets. Modules with p>6.9×10⁻⁴ (corrected for 72 comparisons) were defined as specific to fetal tissue. To identify additional modules with conserved topology but significant differential expression between fetal and normal adult myocardium, significance analysis of microarrays was used to evaluate gene-wise differential expression between fetal and normal adult myocardium (Storey J D, Tibshirani R. Statistical methods for identifying differentially expressed genes in DNA microarrays. Methods Mol Biol. 2003; 224:149-157). A d-score was assigned to each gene that quantified the significance of its differential expression between phenotypes. For each module, an average d score was calculated by dividing this aggregate d score by the number of genes in the module. A p value was assigned to this average d score via a permutation-derived significance test. The p value for each average modular d score was given by the proportion of ten thousand permutations in which random gene modules of the same size were associated with a greater average d score than derived clusters. Significant modular differential expression was inferred from p<0.002 (corrected for 26 comparisons). This set of modules together with the set of modules with topology specific to fetal tissue was thus defined as the “fetal gene program” and evaluated further for reproducibility in hypertrophied and failing myocardial tissue using the iterative module reproducibility algorithm described above. Fetal module reproducibility in heart failure and hypertrophy was inferred from a p value of 0.001 for 50 comparisons.

Transcription Factor Analysis

Gene coexpression modules were next analyzed for over-representation of known transcription factor targets. Each gene g was linked to a transcription factor if it was identified as a target of that transcription factor. For each transcription factor t in each co-expression module, the hypothesis that transcription factor t was over-represented in the list of genes comprising the module when compared to what was expected by chance alone was tested. This association was measured using the hypergeometric distribution as the probability of observing k or more genes targeted by t transcription factors in the set of n genes, when there are K genes targeted by transcription factor t in the whole set of N genes. Given that 56 total transcription factors were analyzed in the module shared between fetal and hypertrophied and failing myocardium, significance of association was inferred from a p value less than 9×10⁻⁴.

Module Membership of Key Genes and Higher Order Topology of Fetal Gene Transcriptome

To summarize the scaled gene expression profiles modules, the first singular vector (referred to as the module eigengene) that resulted from singular value decomposition was used. The module eigengene is equivalent to the first principal component and explains the largest proportion of the variance of the module genes. Based on the correlations between the module eigengenes of different modules, an eigengene network was constructed that can be used to assess higher order topology of gene coexpression (Langfelder P, Horvath S. Eigengene networks for studying the relationships between co-expression modules. BMC Syst Biol. 2007; 1:54). In this way the gene coexpression network was reduced from thousands of genes to several representative eigengenes that in turn form nodes in a meta-network describing higher-order topology of the transcriptome. Similar eigengenes were calculated for modules in hypertrophied and failing myocardium. To evaluate higher-order reproducibility of fetal gene modules in myocardial adaptation, an adjacency matrix was calculated for eigengene networks in each phenotype considered via the Pearson correlation between eigengenes across all microarrays representing each phenotype. Preservation of inter-modular connections was then assessed via the preservation density, which was calculated as follows:

$\begin{matrix} {{D\left( {preserve}^{1,2} \right)} = {1 - \frac{\sum\limits_{i}\; {\sum\limits_{j \neq i}\; {{a_{ij}^{1} - a_{ij}^{2}}}}}{n\left( {n - 1} \right)}}} & (3) \end{matrix}$

Where D is the preservation density, a_(ij) ¹ and a_(ij) ² are adjacencies for each phenotype, and n is the total number of genes. The closer the preservation density is to 1, the stronger the evidence that the connectivity pattern is preserved between the two networks.

A measure of module membership kME_(i)=cor(x_(i), ME) can be defined as the correlation between expression profile of gene i and the module eigengene. The closer the absolute value of kME_(i) is to 1, the stronger the evidence that the gene belongs to the module represented by the module eigengene (Horvath S, Dong J. Geometric interpretation of gene coexpression network analysis. PLoS Comput Biol. 2008; 4:e1000117). Therefore, marker genes not assigned to fetal modules (MYH6, PLN) were re-assigned to the module with highest kME. Module membership of other marker genes (NPPA, MYH7, SLC2A4) was simply the module assignment of each respective gene as given by the algorithm described previously.

Supplementary Methods

Data Combination and Normalization

Prior to data analysis, all 1617 arrays had to be appropriately combined such that probeset identifiers matched to homologous genes. Probeset annotation files for each array platform were used to identify the gene symbol associated with each probeset and this gene symbol was mapped to a unique identifier for homologous gene groups provided by the homologene database (www.ncbi.nlm.nih.gov/HomoloGene). Data were then combined from all experiments by matching homologene identifiers. To allow for comparison of absolute gene expression data between phenotypes across different experimental data sets, normalization of raw expression data was performed prior to all analyses. Median absolute deviation normalization of log transformed expression values was chosen for its performance advantages in sparse datasets (Barbacioru C C, Wang Y, Canales R D, Sun Y A, Keys D N, Chan F, Poulter K A, Samaha R R. Effect of various normalization methods on Applied Biosystems expression array system data. BMC Bioinformatics. 2006; 7:533). Via this method, intensity-dependent differences in expression level were minimized for between-sample comparisons, as evidenced by M-A plot (See FIG. 7). Of note, the Pearson correlation is invariant under data transformation of location and scale and thus this normalization step does not affect gene expression network analysis.

Shown in FIGS. 7A and B are charts demonstrating the normalization of gene expression data using mean absolute deviation method reduces intensity-dependent differences for experiments from different experimental conditions according to an embodiment of the present invention. FIGS. 7A.1 and 7A.2 depict the relationship between mean and difference between experimental arrays for all experiments with the same experimental set (e.g., FIG. 7A.1 with similar laboratory conditions) and for experiments not within the same experimental set (e.g., FIG. 7A.2 with different laboratory conditions) before performing normalization. Horizontal scatter envelopes centered at 0 indicate minimal intensity-dependent differences in expression level. Shown in FIGS. 7B.1 and 7B.2 are the same relationships after performing normalization. FIGS. 7B.1 and 7B.2 indicate minimization of intensity-dependent differences in expression among disparate experimental conditions.

Species Specificity

Species specificity of gene coexpression was analyzed to assess for primacy of species over phenotype. Experiments were clustered according to shared expression profiles using agglomerative hierarchical clustering. As shown in FIG. 8, array experiments were tightly coordinated by species, indicating a significant species dependency for gene coexpression. As such, datasets from species with the most complete data set with regard to phenotype, mouse, were chosen for further analysis.

Shown in FIGS. 8A-8D are charts that assist in understanding gene expression profiles from different experimental conditions clustered by species according to an embodiment of the present invention. Hierarchical clustering of experimental conditions was performed to evaluate for species-specific patterns of gene coexpression. The top half of each figure depicts the hierarchical clustering of experiments within each experimental phenotype and the bottom half contains a color-keyed representation of species identifiers for each experiment. In normal adult (FIG. 8A), fetal (FIG. 8B), failing (FIG. 8C), and hypertrophied (FIG. 8D) cardiac tissue experiments clustered tightly by species, indicating strong species dependency of gene coexpression. Abbreviations are: H=human, M=mouse, R=rat, D=dog.

Forming Networks

To perform network modeling, weighted gene coexpression analysis was used as described in Zhang et al (Zhang B, Horvath S. A general framework for weighted gene co-expression network analysis. Stat Appl Genet Mol Biol. 2005; 4:Article17). The vertices of a weighted gene co-expression graph represent genes and the connection strengths (adjacency) between them comprise edges that connect vertices. The connectivity k_(i) of the i-th vertex (gene) is defined as the sum of its adjacencies with the other nodes in the network. The frequency distribution of the connectivity often (but not always) follows a power law, which is referred to as scale free topology. Many (but not all) biologic networks have been found to satisfy scale free topology (Barabasi A L, Albert R. Emergence of scaling in random networks. Science. 1999; 286:509-512; Jeong H, Tombor B, Albert R, Oltvai Z N, Barabasi A L. The large-scale organization of metabolic networks. Nature. 2000; 407:651-654). The connection strength or adjacency was determined by the Pearson correlation of the corresponding expression profiles.

Specifically, the connection strength (adjacency)

A _(ij) =|cor(x _(i) ,x _(j))|^(b)

between expression profiles i and j is a power of the absolute value of their Pearson correlation. Forming the power of a correlation amounts to a soft-thresholding approach that emphasizes large correlations at the expense of low correlations and obviates the need for arbitrary thresholds for adjacency, thus preserving the continuous nature of the co-expression information. The extent to which a network satisfies scale free topology can be measured using a model fitting index R². FIGS. 9 and 10 show how the scale free topology model fitting index R² (y-axis) depends on the soft threshold b (x-axis).

Shown in FIGS. 9A and 9B are gene coexpression network adjacency soft thresholds for different conditions according to an embodiment of the present invention. FIG. 9. Shown in FIG. 9A is a gene coexpression network adjacency soft threshold for normal adult was chosen to approximate scale-free topology. Shown in FIG. 9A is a gene coexpression network adjacency soft threshold for fetal cardiac tissue was chosen to approximate scale-free topology. FIGS. 9A.1 and 9B.1 depict the relationship between soft-thresholding coefficient (x axis, numbered in red) and the r value for the linear regression (y axis, slope constrained to −1) between the logarithm of the connectivity and the logarithm of the proportion of genes with that connectivity. FIGS. 9B.1 and 9B.2 depict the relationship between the soft-thresholding coefficient and mean connectivity.

Shown in FIG. 10 are gene coexpression network adjacency soft threshold for different conditions according to an embodiment of the present invention. Failing (FIGS. 10A.1 and 10A.2) and hypertrophied cardiac tissue (FIGS. 10B.1 and 10B.2) were chosen to approximate scale-free topology. FIGS. 10A.1 and 10B.1 depict the relationship between soft-thresholding coefficient (x axis, numbered in red) and the r value for the linear regression (y axis, slope constrained to −1) between the logarithm of the connectivity and the logarithm of the proportion of genes with that connectivity. FIGS. 10B.1 and 10B.2 depict the relationship between the soft-thresholding coefficient and mean connectivity.

The factor b was chosen according to the scale free topology criterion, which amounts to choosing the smallest b that leads to approximate scale free topology. To facilitate comparisons, b=9 was chosen for all networks. A major advantage of weighted co-expression networks is that they are statistically highly robust with regard to the chosen threshold b. The final adjacency was further defined using the topological overlap (Yip A M, Horvath S. Gene network interconnectedness and the generalized topological overlap measure. BMC Bioinformatics. 2007; 8:22). This network adjacency describes the commonality of network neighbors for genes i and j and is given by:

$\begin{matrix} {l_{ij} = \left\{ {{{\begin{matrix} \frac{I_{ij} + a_{ij}}{{\min \left\{ {k_{i},k_{j}} \right\}} + 1 - a_{ij}} & {{{if}\mspace{14mu} i} \neq j} \\ 1 & {{{if}\mspace{14mu} i} = j} \end{matrix}{where}I_{ij}} = {\sum\limits_{{u \neq i},j}\; {a_{iu}a_{uj}}}},{k_{i} = {\sum\limits_{u \neq i}\; a_{iu}}},{{{and}k_{j}} = {\sum\limits_{u \neq j}\; a_{uj}}}} \right.} & (1) \end{matrix}$

for a_(iu) and a_(uj) given by soft-thresholded Pearson correlation as described above. For all t_(ij) a distance d_(ij)=1−t_(ij) was defined, giving a symmetrical distance matrix of 12620 rows and 12620 columns. This distance matrix was used as the input for agglomerative hierarchical clustering in the training set of 43 fetal samples and subsequently in a validation set of 21 fetal samples and in other phenotypes. Separate distance matrices were constructed for each phenotype and for training and validation fetal datasets. The dynamic tree cut algorithm was used to define gene modules from the resultant hierarchical clustering. For this algorithm, minimum cluster size was set at 25, the “hybrid” tree cut algorithm was chosen, deep split was set to one, and no minimum or maximum absolute core scatter were defined; sensitivity analysis for the parameters chosen for this function is described in FIG. 11.

Shown in FIG. 11 is a sensitivity analysis for parameters used in the dynamic tree cut algorithm according to an embodiment of the present invention. The z-axis describes module reproducibility average over all derived modules, which is in turn given by the Pearson correlation between vectors describing intra-modular connectivity for each gene. In an embodiment, the chosen parameters for the dynamic tree cut algorithm (deepsplit=1, labelunlabeled=TRUE) gave the highest average reproducibility of modules in the fetal dataset.

The 90 resulting networks were subsequently internally validated using an iterative approach as described above in the validation set (see FIG. 12). The degree to which topology of these validated modules was shared between phenotypes was then evaluated using the same iterative approach. To illustrate topology of key marker genes previously identified to be associated with cardiac development and disease, gene coepxression topology of networks containing these genes is displayed in FIG. 13.

Shown in FIGS. 12A-12C are charts and graphs demonstrating the identification and validation of fetal gene modules according to an embodiment of the present invention. In an embodiment, genes were clustered first by topological overlap in the training dataset (n=43 samples) and then evaluated for significant reproducibility in the validation dataset (n=21 samples). Shown in FIG. 12A is a cluster dendrogram and barplot describing module membership for each of the 12,620 genes. Shown in FIG. 12B is a cluster dendrogram and barplot describing module membership for each of the 12,620 genes (as defined in the training set) in the validation dataset. Shown in FIG. 12C is a significance (−log p value) of module reproducibility in the validation set. The cutoff for statistical significance (p<5.5×10-4 or −log(p value)>3.26) is shown. Seventy-two modules met this criterion.

Shown in FIGS. 13A and 13B is a topology of developmental gene coexpression modules containing key marker genes. The hub genes for modules containing NPPA and MYH7 (FIG. 13A) and SLC2A4 (FIG. 13B) are depicted by an “X”, while marker genes are depicted by a “+.” Only edges with adjacency weight>0.7 are shown for clarity. The proximity of each gene to the center of the figure indicates its connectivity, or sum of connection weights.

Transcription Factor Target Identification

Transcription factor target identification was performed to assess for common transcription factor modulation of gene coexpression modules. The phylogenetic footprinting approach was applied to determine putative gene targets for a large set of vertebrate transcription factors in the mouse genome (Levy S, Hannenhalli S. Identification of transcription factor binding sites in the human genome sequence. Mamm Genome. 2002; 13:510-514). All eukaryotic transcription factor motifs were first clustered from TRANSFAC v10.2 into 235 representative motifs as described (Wingender E, Dietze P, Karas H, Knuppel R. TRANSFAC: a database on transcription factors and their DNA binding sites. Nucleic Acids Res. 1996; 24:238-241; Hannenhalli S, Putt M E, Gilmore J M, Wang J, Parmacek M S, Epstein J A, Morrisey E E, Margulies K B, Cappola T P. Transcriptional genomics associates FOX transcription factors with human heart failure. Circulation. 2006; 114:1269-1276). For each of the 235 motifs provided in the TRANSFAC database, the mouse 1 kb promoter regions were scanned to first identify match with a p-value≦0.0002. These matches were further filtered to only include the binding sites that are either 80% conserved between human and mouse genomes based on the genome alignment provided in UCSC database or the match p-value≦0.00002. In a cross validation with experimentally verified binding sites, these criteria yielded a false positive rate of 1 in 50 kB of genomic data searched, or a false discovery rate of 1 transcription factor for every 50 genes searched in the algorithm according to an embodiment of the present invention.

Results

Network Topology can be Defined for Large Datasets of Myocardial Gene Expression Data

Fetal gene programs were first defined as modules of gene co-expression specific to fetal myocardial tissue. To this end, weighted gene co-expression networks were constructed for fetal myocardial tissue using every murine transcript abundance dataset available in the Gene Expression Omnibus (GEO) database. A data search strategy and summary of experimental conditions according to an embodiment of the present invention is presented in FIG. 1. Modules were internally validated using a data-splitting technique in which 43 (67%) samples were used as a training set and 21 (33%) samples were used to validate the derived modules. In the training set, hierarchical clustering of the topological overlap matrix identified 90 modules composed of between 25 and 1309 genes. A summary of analyzed genes and modules is presented further below. The preservation of all gene-gene connections between training and validation datasets as defined by the preservation density was 94%. Of the 90 original modules, 72 had network topology that was significantly preserved in the validation set (p<5.5×10⁻⁴ for reproducibility) and were carried forward for all subsequent analysis.

Gene Expression in the Developing Heart Involves Modules not Conserved in Normal Adults

Because there are modules of gene expression that represent constitutive genetic programs of cardiac tissue, the subset of these validated modules was next identified that were not reproduced in normal adult myocardium (n=273 samples). Of the 72 valid modules identified in fetal myocardium, 26 were also found in normal myocardium (p<6.9×10-4 for reproducibility between fetal and normal datasets, FIG. 2). Four of these 26 modules were reproduced in adult myocardium but also exhibited significant differential expression (p value for averagedifferential expression<0.002) as a module. Thus, 50 modules of between 25 and 914 genes were found to comprise the gene expression program of the developing heart.

Shown in FIG. 2 is the identification of gene coexpression modules specific to fetal tissue. Gene-gene adjacencies were first defined by the topological overlap, which quantifies the degree of shared network neighbors given by gene-wise Pearson correlation. Hierarchical clustering of adjacencies was first performed in the training dataset of 43 fetal samples and resulting coexpression modules were evaluated for significant reproducibility in the validation dataset of 21 fetal samples. The seventy-two valid fetal modules were evaluated for reproducibility in normal adult myocardium as well as differential expression. Shown in FIG. 2A is a cluster dendrogram and barplot describing module membership (according to clusters derived in the training fetal dataset) for normal adult myocardium. Shown in FIG. 2B is a representation of the reproducibility of valid fetal modules in normal adult myocardium. The horizontal line indicates cutoff for statistical significance (p<6.9×10-4 or −log(p value)>3.16). Modules below this cutoff were defined as having topology specific to fetal myocardium and defined as fetal modules. Shown in FIG. 2C is a heatmap of average expression of each module in fetal and normal adult myocardium. Shown in FIG. 2D is the average significance level of differential expression (average Significance Analysis of Microarrays d score) for each module according to an embodiment of the present invention. Negative values correspond to lower expression levels in normal than fetal myocardium.

Known Marker Genes for Heart Failure are Members but not Hubs of Conserved Modules

The composition of these conserved modules was next examined using annotated gene ontologies. Topology of gene modules containing key marker genes is described in FIG. 13. Matching expected results, MYH7 and NPPA were members of a module up-regulated in fetal tissue compared to normal tissue (156 members, average d score 4.66). SLC2A4 was a member of a module down-regulated in fetal compared to normal tissue (64 members, average d score −4.95). Though not contained in modules specific to fetal cardiac tissue, MYH6 and PLN expression most closely aligned with members of a module down-regulated in fetal compared to normal tissue (90 members, average d score −1.6).

Limited Recapitulation of Fetal Modules in Failing Myocardium

The hypothesis that fetal gene modules are re-capitulated in failing (n=41 samples) and pathologically hypertrophied myocardial tissue (n=100 samples) was evaluated. These data are summarized in FIG. 3 and in Tables 1 and 2 (FIGS. 14 and 15, respectively). FIG. 14 (Table 1) is tabular data regarding over-represented biological processes according to gene ontology (GO) in the fetal module recapitulated in heart failure and hypertrophy according to an embodiment of the present invention. FIG. 15 (Table 2) is tabular data summarizing genes involved in protein catabolism biological process from fetal module recapitulated in heart failure and hypertrophy.

Shown in FIGS. 3A-E are charts and graphs demonstrating the reproducibility of developmental gene modules in heart failure and cardiac hypertrophy according to an embodiment of the present invention. The modules unique to fetal myocardium were evaluated for reproducibility in heart failure and cardiac hypertrophy. Cluster dendrograms and barplots describe module membership (according to clusters derived in the training fetal dataset). Shown in FIG. 3A is myocardium in murine heart failure. Shown in FIG. 3B is myocardium in murine cardiac hypertrophy. Bargraphs describe significance of module reproducibility in FIGS. 3C (heart failure) and 3D (cardiac hypertrophy). A horizontal line indicates cutoff for statistical significance (p<0.001 or −log(p value)>3). E.) Proportion of fetal gene modules recapitulated in heart failure only, cardiac hypertrophy only, both, or neither.

Seven (14%) of 50 fetal gene modules were reproduced in failing myocardium with p<0.001. Three (6%) of 50 fetal gene modules were significantly reproduced in pathological hypertrophy. One module (FIG. 4) composed of 246 genes was reproduced in both failing and hypertrophied myocardium. The hub gene for this module, dihydroepiandrosterone sulfotransferase (SULT2A1) encodes an enzyme with high levels of expression in heart, adrenal glands, and liver and catalyzes the conjugation of steroids and bile acids (Nakamura Y, Xing Y, Sasano H, Rainey W E. The mediator complex subunit 1 enhances transcription of genes needed for adrenal androgen production. Endocrinology. 2009; 150:4145-4153; Nowell S, Falany C N. Pharmacogenetics of human cytosolic sulfotransferases. Oncogene. 2006; 25:1673-1678).

Shown in FIG. 4 is the topology of gene coexpression module common to developing myocardium and hypertrophied and failing myocardium according to an embodiment of the present invention. Edges with adjacency weight>0.7 are shown for clarity. The proximity of each gene to the center of the figure indicates its connectivity, or sum of connection weights. Genes with connectivity<50 are omitted for clarity.

Transcription Factors are Significantly Over-Represented in Discovered Modules

Over-representation of transcription factor targets in each module were then analyzed. Targets of the transcription factor zinc finger protein 2 (ZIC2, product of gene ZIC2) were significantly over-represented in the fetal module common to both failing and hypertrophied myocardium (p=3×10⁻⁴) after correcting for multiple comparisons. Fetal modules reproduced in heart failure were enriched in targets of transcription factor HOXA5 (p=3×10⁻⁵) and fetal modules reproduced in hypertrophy were enriched in targets of the transcription factor FOXN1 (p=4×10⁻⁴).

Ontology Analysis Reveals Key Role of Metabolic Processes in Fetal and Failing Myocardium

Common biological process pathways were found among fetal and failing myocardium. In the fetal module reproduced in heart failure and cardiac hypertrophy, proteasomal protein catabolic processes (p=5.9×10⁻⁵, corrected p=0.085) and protein catabolic processes (p=4.2×10⁻⁴, corrected p=0.085) were overrepresented. Other biological processes overrepresented in gene modules common to developing and failing myocardium (n=7) were cytoskeleton organization and biogenesis (corrected p=8.5×10⁻³) and fatty acid oxidation (corrected p=9.2×10⁻³). Fetal modules recapitulated in hypertrophy (n=3) were enriched in genes involved in membrane lipid metabolic processes (corrected p=0.02). Other fetal modules with significant enrichment for biological processes were involved in stress response, nervous system development, biopolymer metabolism, and innate and humoral immunity (corrected p<0.05).

Higher Order Topological Overlap is Poorly Preserved Between Fetal and Failing Myocardium

Eigengene networks reveal connectivity patterns between modules, e.g., they allow one to study higher order topology of the transcriptome and relationships between biological pathways represented in each gene module. Derived eigengenes explained between 32% and 67% of the variance in gene expression for each module. Preservation of higher order topology of the module network is presented in FIG. 5 in both graphical form and as cluster dendrograms. The preservation of fetal inter-modular connections in failing and hypertrophied myocardial tissue was low (52% and 58%, respectively).

Shown in FIG. 5 are representations of the reproducibility of higher-order network topology between fetal myocardium and myocardium in murine heart failure and cardiac hypertrophy according to an embodiment of the present invention. Singular value decomposition was used to generate a representative eigengene for each fetal module that described most of the variance in modular gene expression. Hierarchical clustering of eigengenes was then used to identify inter-modular connections. Heatmaps describe the correlation between eigengenes and the dendrograms represent clustering of module eigengenes in FIGS. 5A (fetal myocardium), 5B (myocardium in murine heart failure), and 5C (myocardium in cardiac hypertrophy). The line of identity has correlation=1 for all modules. Graphical representation of eigengene networks is represented for fetal myocardium in FIG. 5D, heart failure in FIG. 5E, and cardiac hypertrophy in FIG. 5F. Values given for preservation refer to preservation of fetal inter-modular connections in heart failure and cardiac hypertrophy.

Discussion

In this study, network analysis tools were applied to the largest dataset of myocardial transcriptomic data assembled to date. Modular sets of genes with a high degree of coexpression in developing myocardial tissue were identified an evaluated. Their conservation was tested across different pathological states. Specially, the question of whether a coordinated fetal gene expression program is recapitulated in heart failure and hypertrophy was addressed.

In traditional differential gene expression analysis, investigators have applied statistical tools that were developed for evaluation of associations between a small number of predictors (e.g., clinical characteristics or medical interventions) and a large number of outcomes (e.g., mortality). However, these analytical approaches are often inadequate to describe the associations between the large number of gene expression signatures and relatively small number of phenotype experiments in genome-wide expression studies. Identification of a relatively small number of modules and subsequent module-wise, rather than gene-wise, analysis yields a lower false discovery rate than would be expected with more traditional analyses. Modular gene expression analysis also facilitated evaluation of higher-order topology of the transcriptome and associated transcriptional regulators in heart failure and developing myocardium.

Using these analytical tools, a significant proportion of modular gene programs were found to be either unique to fetal cardiac tissue or to be present but differentially expressed in normal adult myocardium. Analysis of the composition of these fetal gene modules corroborated prior evidence of differences in myocardial energetic pathways and sarcomeric composition between fetal and normal adult myocardium 2-5. Furthermore, fetal modules were identified that were involved in membrane lipid metabolism, nervous system development, and innate and humoral immunity whose importance to developing myocardium, if confirmed in experimental model systems, enriches the understanding of myocardial development.

It was found that key fetal pathways of myocardial energetics previously demonstrated to be recapitulated in myocardial adaptation (Hansson A, Hance N, Dufour E, Rantanen A, Hultenby K, Clayton D A, Wibom R, Larsson N G. A switch in metabolism precedes increased mitochondrial biogenesis in respiratory chain-deficient mouse hearts. Proc Natl Acad Sci USA. 2004; 101:3136-3141) were represented in gene modules that were preserved between fetal and failing and hypertrophied myocardium. Novel pathways that were shared between developing and failing/hypertrophied myocardium were also identified in the analysis of the present invention and may represent key stress response pathways important to both the fetal circulation and myocardial adaptation. It is yet unclear which of these pathways may represent maladaptive cardiovascular stress responses.

This set of recapitulated modules represented a small proportion of the identified fetal gene program and there was little overlap between fetal modules recapitulated in heart failure and those recapitulated in cardiac hypertrophy. These findings have important implications for the understanding of myocardial adaptation to injury. The hypothesis that a coordinated program of developmental gene expression is recapitulated in myocardial adaptation is accepted (Vanderheyden M, Mullens W, Delrue L, Goethals M, de Bruyne B, Wijns W, Geelen P, Verstreken S, Wellens F, Bartunek J. Myocardial gene expression in heart failure patients treated with cardiac resynchronization therapy responders versus nonresponders. J Am Coll Cardiol. 2008; 51:129-136). Results according to an embodiment of the present invention suggest that this model of myocardial adaptation may be inadequate to describe the genomic response to myocardial stress. Instead, the spectrum of myocardial adaptation in which compensatory hypertrophy progresses to myocardial failure appears to involve diverse and plastic pathways of genomic response which are not easily distilled to one coordinated program of fetal gene expression. Furthermore, it was found that the higher order organization of genomic response, represented by connections between modules of shared gene expression, was disparate between developing, failing and hypertrophied myocardium.

Several regulators of expression, including transcriptional activators and repressors, histone deacetylases, and microRNAs, have been shown to modulate transcription and translation of portions of the gene expression program of the developing heart and heart failure and hypertrophy (Thattaliyath B D, Livi C B, Steinhelper M E, Toney G M, Firulli A B. HAND1 and HAND2 are expressed in the adult-rodent heart and are modulated during cardiac hypertrophy. Biochem Biophys Res Commun. 2002; 297:870-875; Kuwahara K, Saito Y, Takano M, Arai Y, Yasuno S, Nakagawa Y, Takahashi N, Adachi Y, Takemura G, Horie M, Miyamoto Y, Morisaki T, Kuratomi S, Noma A, Fujiwara H, Yoshimasa Y, Kinoshita H, Kawakami R, Kishimoto I, Nakanishi M, Usami S, Saito Y, Harada M, Nakao K. NRSF regulates the fetal cardiac gene program and maintains normal cardiac structure and function. Embo J. 2003; 22:6310-6321; Trivedi C M, Luo Y, Yin Z, Zhang M, Zhu W, Wang T, Floss T, Goettlicher M, Noppinger P R, Wurst W, Ferrari V A, Abrams C S, Gruber P J, Epstein J A. Hdac2 regulates the cardiac hypertrophic response by modulating Gsk3 beta activity. Nat Med. 2007; 13:324-331; Thum T, Galuppo P, Wolf C, Fiedler J, Kneitz S, van Laake L W, Doevendans P A, Mummery C L, Borlak J, Haverich A, Gross C, Engelhardt S, Ertl G, Bauersachs J. MicroRNAs in the human heart: a clue to fetal gene reprogramming in heart failure. Circulation. 2007; 116:258-267; van Rooij E, Sutherland L B, Liu N, Williams A H, McAnally J, Gerard R D, Richardson J A, Olson E N. A signature pattern of stress-responsive microRNAs that can evoke cardiac hypertrophy and heart failure. Proc Natl Acad Sci USA. 2006; 103:18255-18260). Here, a novel association of the transcription factor ZIC-2 with gene coexpression modules common to myocardial development and adult myocardial adapation was identified. Loss of function mutations in this transcription factor are associated with a defect of forebrain development, holoprosencephaly, complex skeletal defects and congenital heart disease (Grinberg I, Millen K J. The ZIC gene family in development and disease. Clin Genet. 2005; 67:290-296). Modules common to developing and failing myocardium were enriched in targets of HOXA5, a transcription factor with known roles in lung development in mammals and apoptotic heart morphogenesis in amphibians (Kim C, Nielsen H C. Hoxa-5 in mouse developing lung: cell-specific expression and retinoic acid regulation. Am J Physiol Lung Cell Mol Physiol. 2000; 279:L863-871; Grier D G, Thompson A, Lappin T R, Halliday H L. Quantification of Hox and surfactant protein-B transcription during murine lung development. Neonatology. 2009; 96:50-60; Gaur A, Bhatia R, Spring-Mills E, Lemanski L F, Dube D K. The heart of metamorphosing Mexican axolotl but not that of the cardiac mutant is associated with the upregulation of Hox A5. Biochem Biophys Res Commun. 1998; 245:746-751). Targets of the transcription factor FOXN1, which is known to be involved in thymic development and lymphocyte regulation, were found in modules shared between developing and hypertrophied myocardium (Coffer P J, Burgering B M. Forkhead-box transcription factors and their role in the immune system. Nat Rev Immunol. 2004; 4:889-899). This transcription factor is involved in cardiomyocte-rich embryoid body development in pharyngeal endoderm and other members of the forkhead box transcription factor family have been implicated in human and murine heart failure (Hidaka K, Nitta T, Sugawa R, Schwartz R J, Amagai T, Nitta S, Takahama Y, Morisaki T. Differentiation of Pharyngeal Endoderm and Derivatives from Mouse Embryonic Stem Cells. Stem Cells Dev. 2010; 19:1735-1743; Hannenhalli S, Putt M E, Gilmore J M, Wang J, Parmacek M S, Epstein J A, Morrisey E E, Margulies K B, Cappola T P. Transcriptional genomics associates FOX transcription factors with human heart failure. Circulation. 2006; 114:1269-1276).

In an embodiment of the present invention, evidence is shown of the role of these transcription factors in modulation of the gene programs common to developing myocardium and myocardial adaptation. With experimental confirmation of the role of these transcription factors inmodulation of gene expression programs of the developing heart and identification of additional common transcription factor and microRNA targets, it may be possible to identify regulators of fetal gene networks that allow for manipulation of maladaptive portions of the program.

Myocardial expression data was evaluated from disparate sources which utilized different experimental model systems and different tissue preparation, transcript isolation, and microarray techniques. This can be considered a strength of the present invention. It is unlikely that a single model from one laboratory adequately captures the diversity of disease. By concentrating on the findings common to many different models and techniques from many different laboratories, focus can be placed on the most robust changes common to all. Further, normalization minimized between-array differences, while a module identification technique robust to noise was used with an iterative approach to validate modules.

Due to the data from human samples, an embodiment of the present invention concentrated on murine experimental models.

CONCLUSION

In an embodiment of the present invention, a gene expression topology of the developing and diseased heart is described. Key transcriptional regulators of the fetal gene expression program are identified. Novel module conservation algorithms are applied to address the question of the recapitulation of fetal gene markers and gene networks in heart failure and hypertrophy. While marker genes are clearly upregulated in gene networks common to developing and diseased myocardium, they are not found to be the most highly connected genes. In addition, evidence was found for a coordinated recapitulation of gene expression programs found in the developing heart that is recapitulated in myocardial adaptation.

There are over 1600 single-gene (“Mendelian” inherited disorders for which the molecular basis is unknown. An additional 1800 suspected Mendelian traits have unknown genetic basis. Current approaches to identifying the genetic basis of disease (positional cloning, linkage analysis) are costly and time consuming and have limited sensitivity for defining genetic syndromes. The advent of whole genome sequencing has allowed increased sensitivity for polymorphisms associated with disease, but there is a need for better prioritization of putative gene targets in families with inherited disease syndromes. The method described here leverages the wealth of information from gene expression, protein interaction, and genetic variation to prioritize genes for co-segregation analysis.

Hierarchical link clustering is computationally intensive. The present invention can be implemented with hierarchical clustering of genes based on topological overlap, a similarity measure that captures the degree of shared network neighbors. Based on results of “leave-one-out” cross-validation, the threshold can be modified for step length (degree of separation between network neighbors) in identification of genes related to known disease genes. In another embodiment, clustering phenotypes can be considered based on shared genetic and clinical features to enlarge the starting scaffold set of disease genes.

It should be appreciated by those skilled in the art that the specific embodiments disclosed above may be readily utilized as a basis for modifying or designing other algorithms or systems. It should also be appreciated by those skilled in the art that such modifications do not depart from the scope of the invention as set forth in the appended claims. 

What is claimed is:
 1. A computer-implemented method for interpreting genetic data, comprising: receiving a first set of genetic data from a first source; generating an interaction network from the first set of data; identifying at least one disease associated with at least one gene from the interaction network; receiving a second set of genetic data from a second source, wherein the second set of genetic data includes genetic variation information; identifying genetic variants within a predetermined path length within the network model; and prioritizing genetic variants responsive to the identified genetic variants.
 2. The method of claim 1, wherein the interaction network comprises a weighted graph that includes vertices that represent genes and edges that represent interactions between genes.
 3. The method of claim 2, further comprising creating at least one hierarchical cluster of edges of the graph.
 4. The method of claim 1, wherein the predetermined path length is less than or equal to
 2. 5. The method of claim 1, wherein the predetermined path length is less than or equal to
 3. 6. The method of claim 1, wherein prioritizing genetic variants includes prioritizing genetic information from genotyping arrays.
 7. The method of claim 1, wherein prioritizing genetic variants includes identifying genes that co-reside with predetermined disease genes.
 8. The method of claim 1, further comprising updating prior probabilities of pathogenicity of discovered variants.
 9. The method of claim 1, wherein the first source includes microarray data from normal and diseased model system tissues.
 10. The method of claim 1, wherein the first source includes data obtained from genome wide genotyping.
 11. A computer-readable medium including instructions that, when executed by a processing unit, cause the processing unit to interpret genetic data, by performing the steps of: receiving a first set of genetic data from a first source; generating an interaction network from the first set of data; identifying at least one disease associated with at least one gene from the interaction network; receiving a second set of genetic data from a second source, wherein the second set of genetic data includes genetic variation information; identifying genetic variants within a predetermined path length within the network model; and prioritizing genetic variants responsive to the identified genetic variants.
 12. The computer-readable medium of claim 11, wherein the interaction network comprises a weighted graph that includes vertices that represent genes and edges that represent interactions between genes.
 13. The computer-readable medium of claim 12, further comprising creating at least one hierarchical cluster of edges of the graph.
 14. The computer-readable medium of claim 11, wherein the predetermined path length is less than or equal to
 2. 15. The computer-readable medium of claim 11, wherein the predetermined path length is less than or equal to
 3. 16. The computer-readable medium of claim 11, wherein prioritizing genetic variants includes prioritizing genetic information from genotyping arrays.
 17. The computer-readable medium of claim 11, wherein prioritizing genetic variants includes identifying genes that co-reside with predetermined disease genes.
 18. The computer-readable medium of claim 11, further comprising updating prior probabilities of pathogenicity of discovered variants.
 19. The computer-readable medium of claim 11, wherein the first source includes microarray data from normal and diseased model system tissues.
 20. The computer-readable medium of claim 11, wherein the first source includes data obtained from genome wide genotyping.
 21. A computing device comprising: a data bus; a memory unit coupled to the data bus; a processing unit coupled to the data bus and configured to receiving a first set of genetic data from a first source; generating an interaction network from the first set of data; identify at least one disease associated with at least one gene from the interaction network; receive a second set of genetic data from a second source, wherein the second set of genetic data includes genetic variation information; identify genetic variants within a predetermined path length within the network model; and prioritize genetic variants responsive to the identified genetic variants. 